After building production multi-agent pipelines with Claude as the primary reasoning engine, I've found a set of patterns that rarely appear in the official docs. The current lineup is Opus 4.8 (launched May 28, 2026), Sonnet 4.6, and Haiku 4.5 — all with 1M token context windows. Here is what actually matters for keeping costs down and pipelines stable.

1. The Claude Model Lineup in June 2026

Opus 4.8
$5 / $25 per 1M tokens. Complex reasoning, long-horizon agents, agentic coding.
Sonnet 4.6
$3 / $15 per 1M tokens. Preferred over Opus 4.5 in 59% of Claude Code tests.
Haiku 4.5
Cheapest, fastest. Routing, classification, triage, bulk labeling.

Opus 4.8 pricing at $5/$25 is a 67% reduction from the Opus 4/4.1 era ($15/$75). The Batch API gives an additional 50% off across all models for async work. Cache reads are billed at roughly 10% of the standard input rate. Stack these correctly and your per-run cost drops dramatically.

Note Claude Mythos Preview also exists as of June 2026 but is invitation-only, used for defensive cybersecurity research. Not relevant for standard production use.

2. Prompt Caching: The Biggest Cost Lever

Prompt caching lets Claude reuse the beginning of a request if that prefix is identical to a previously cached version. Cache reads cost roughly 10% of standard input tokens. On a pipeline with a 50K-token shared codebase context passed to 6 agents, this drops input costs from ~$2 per run to ~$0.20.

As of February 5, 2026, caching uses workspace-level isolation. Caches are scoped per workspace, not per organization — relevant if you share an org with multiple teams.

Python (Anthropic SDK) response = client.messages.create( model="claude-opus-4-8-20260528", max_tokens=4096, system=[ { "type": "text", "text": LARGE_SYSTEM_PROMPT, # 50K tokens of stable context "cache_control": {"type": "ephemeral"} } ], messages=[ {"role": "user", "content": dynamic_user_message} ] ) # Check cache status cache_creation = response.usage.cache_creation_input_tokens cache_read = response.usage.cache_read_input_tokens

3. The Two Cache TTLs (and Why 1 Hour Matters)

Anthropic now offers two TTL options: 5 minutes (default) and 1 hour. The 1-hour cache costs 2x a cache write but is usually the right default for agent workloads. Extended thinking tasks can take longer than 5 minutes to complete, meaning a 5-minute cache evicts before the next agent turn even starts. For long-running pipelines, always request the 1-hour TTL.

Python — 1-hour cache TTL system=[ { "type": "text", "text": STABLE_CONTEXT, "cache_control": { "type": "ephemeral", "ttl": 3600 # 1 hour instead of 5 minutes } } ]

4. Cache Invalidation: The Traps That Kill Your Hits

Cache hits require an identical prefix. Anything that changes the prefix between requests invalidates the cache silently. The most common trap: putting a timestamp in your system prompt.

Never do this Prepending "Today's date is 2026-06-08T14:32:11Z" to your system prompt invalidates the cache on every single request. Move timestamps to the user message, not the system prompt. If you need date grounding, truncate to the day at most and keep it in the dynamic user turn.

Other things that invalidate the prefix even when your actual task prompt is identical: changing tool definitions, toggling extended thinking on or off between requests, adding or removing images, and changing tool_choice settings. Pick settings up front and keep them stable per conversation.

5. Tool Caching: Where to Put cache_control

When you pass a tools array, put cache_control on the last tool in the array. Claude caches the prefix up to and including that marker. If you have more than 15-20 tools, consider deferred tool loading from the start — both for caching efficiency and for model performance, since the model reasons over all tool definitions on every turn.

Python — tool caching placement tools = [ { "name": "search_web", "description": "...", "input_schema": {...} }, { "name": "read_file", "description": "...", "input_schema": {...}, "cache_control": {"type": "ephemeral"} # on the LAST tool } ] # cache_control on the last tool caches the entire tools prefix # stable tools array = cache hits on every call

6. Extended Thinking: budget_tokens is Deprecated

Extended thinking is available on Opus 4.8 and Sonnet 4.6. The API changed on Opus 4.7+: budget_tokens is deprecated. The current parameter is effort, which takes "low", "medium", or "high" instead of a token count. Anthropic manages the allocation internally.

Python — current extended thinking API # OLD (deprecated on Opus 4.7+, will break) thinking={ "type": "enabled", "budget_tokens": 10000 # no longer accepted } # CURRENT (Opus 4.8, Sonnet 4.6) response = client.messages.create( model="claude-opus-4-8-20260528", max_tokens=16000, thinking={ "type": "enabled", "effort": "high" # low | medium | high }, messages=[{"role": "user", "content": complex_audit_prompt}] ) for block in response.content: if block.type == "thinking": pass # internal reasoning chain elif block.type == "text": final_answer = block.text

Important: toggling thinking on and off between turns invalidates prompt caching for the message history. Decide at the conversation level whether thinking is on, and keep it consistent throughout that conversation.

Thinking blocks get cached as part of the request content when you pass them back in tool use conversations. During tool use, return thinking blocks to the API unmodified along with your tool result.

7. Tool Use That Does Not Break Agents

Three rules that prevent infinite tool loops in multi-agent systems:

Python — correct error tool result tool_result = { "type": "tool_result", "tool_use_id": tool_use_block.id, "content": "Error: file not found at path ./src/main.js", "is_error": True # tells Claude this call failed } messages.append({"role": "user", "content": [tool_result]})

8. Context Distillation Between Agents

All current Claude models have a 1M token context window. That does not mean you should use it. In a 6-agent pipeline where each agent passes its full conversation to the next, you are burning tokens and slowing every call. The correct pattern: extract only the structured output at each step and pass that forward.

Node.js — distill, never pass full conversation const researchBrief = await researcher.run(goal); const architectPlan = await architect.run({ goal, research: researchBrief.summary, // ~500 tokens sources: researchBrief.keyFindings // ~300 tokens // NOT: researchBrief.fullConversation // 40,000 tokens });

9. Model Tiering: Use Haiku Where You Do Not Need Opus

Use Haiku 4.5 for routing and classification (which agent handles this? does this input look valid?). Use Sonnet 4.6 for the main agentic work. Reserve Opus 4.8 for the steps that genuinely need deep reasoning: architecture planning, multi-pass security audit, complex code generation. Switching from all-Opus to tiered routing cuts per-run LLM costs by 60% in most pipelines.

10. Batch API for Non-Urgent Work

The Batch API processes requests asynchronously within 24 hours at a flat 50% discount on all input and output tokens. Documentation generation, data classification, evaluation runs, pre-computed analysis — anything that does not need a real-time response should go through the Batch API.

Python — batch request batch = client.messages.batches.create( requests=[ { "custom_id": f"doc-{i}", "params": { "model": "claude-haiku-4-5-20251001", "max_tokens": 1024, "messages": [{"role": "user", "content": doc_prompt}] } } for i, doc_prompt in enumerate(doc_prompts) ] ) # Poll or webhook when complete results = client.messages.batches.results(batch.id)
"Context window management is not a technical problem. It is an architecture decision. Make it intentionally."
Key Takeaway In June 2026: budget_tokens is deprecated, use effort. The 1-hour cache TTL exists now and is the right default for agents. Never put timestamps in system prompts. Place cache_control on the last tool in your array. These four changes alone fix the most common production Claude bugs.